Progress Memo 1
Final Project
Data Science 1 with R (STAT 301-1)
Data source
I will be using the Economic Policy Institute’s data on family budget, which also then corresponding to telling us about the cost of living in each county in America. This dataset provides insights into the average economical weights of different aspects of life for each county in America both annually and monthly, whiles dividing these averages further by also looking at the different family types as well. I have provided an additional copy of the same dataset, with less columns and information from Kaggle. This in case that the analysis that in which I wish to conduct starts to not follow the overall larger EPI dataset. Thus turning towards the smaller one from Kaggle where I can focus in more on each variable, looking only at the annual costs and not both monthly and annually as in the larger EPI provided dataset. I additionally wanted to join this dataset with others that provide information on the minimum wage in each state and perhaps the racial majorities of each county, but I was unable to find suitable datasets, thus I will be adding these variables on myself during the first stages of my EDA. 1
Why this data
The reason I choose this dataset is because I thought it would be interesting to see how the cost of living differs not only on the level of state vs state, but looking further into the issue by seeing how cost of living differs by family size and county location. Additionally, I think this would be an opportunity to learn how the incomes levels and expenses in each county and state differ from each other, while also starting to understand why these differences are present. I further think that by looking additionally at minimum wage in each state, it will bring attention to a lot of the inequity present in the states. Thinking on this now, I could also create a variable which states what the majority race or ethnicity in each county is as well, I think this would be an interesting statistical insight to find and correlate to expense levels in the different counties and states.
Data quality & complexity check
Family Budget Dataset
Family Budget Dataset Codebook
Cost of Living Dataset
Skimming Datasets
Family Budget Dataset
| Name | Piped data |
| Number of rows | 31430 |
| Number of columns | 27 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 23 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| state_abv | 0 | 1 | 2 | 2 | 0 | 51 | 0 |
| areaname | 0 | 1 | 12 | 183 | 0 | 2561 | 0 |
| county | 0 | 1 | 10 | 33 | 0 | 1877 | 0 |
| family | 0 | 1 | 4 | 4 | 0 | 10 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| case_id | 0 | 1 | 1589.31 | 917.22 | 1.00 | 792.00 | 1593.00 | 2386.00 | 3171.00 | ▇▇▇▇▇ |
| metro | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| housing_monthly | 0 | 1 | 922.81 | 347.13 | 350.78 | 715.00 | 868.00 | 1037.00 | 5144.63 | ▇▁▁▁▁ |
| food_monthly | 0 | 1 | 690.63 | 272.60 | 185.02 | 483.45 | 677.43 | 891.97 | 2598.22 | ▇▇▁▁▁ |
| transportation_monthly | 0 | 1 | 1132.82 | 136.70 | 184.71 | 1044.60 | 1141.51 | 1230.48 | 1651.37 | ▁▁▃▇▁ |
| healthcare_monthly | 0 | 1 | 1116.17 | 433.71 | 289.70 | 805.62 | 1090.22 | 1388.15 | 3104.36 | ▅▇▃▁▁ |
| other_necessities_monthly | 0 | 1 | 584.61 | 199.78 | 217.64 | 440.53 | 561.09 | 701.09 | 2402.45 | ▇▅▁▁▁ |
| childcare_monthly | 0 | 1 | 823.30 | 564.85 | 0.00 | 445.14 | 847.20 | 1189.70 | 4069.26 | ▇▇▁▁▁ |
| taxes_monthly | 0 | 1 | 638.14 | 278.32 | 85.65 | 466.50 | 574.87 | 732.52 | 3979.45 | ▇▁▁▁▁ |
| total_monthly | 0 | 1 | 5908.47 | 1820.55 | 2507.31 | 4481.33 | 5914.81 | 7114.28 | 18643.13 | ▇▇▁▁▁ |
| housing_annual | 0 | 1 | 11073.67 | 4165.61 | 4209.31 | 8580.00 | 10416.00 | 12444.00 | 61735.59 | ▇▁▁▁▁ |
| food_annual | 0 | 1 | 8287.50 | 3271.14 | 2220.28 | 5801.42 | 8129.16 | 10703.62 | 31178.62 | ▇▇▁▁▁ |
| transportation_annual | 0 | 1 | 13593.86 | 1640.46 | 2216.46 | 12535.16 | 13698.16 | 14765.76 | 19816.48 | ▁▁▃▇▁ |
| healthcare_annual | 0 | 1 | 13394.03 | 5204.55 | 3476.38 | 9667.44 | 13082.70 | 16657.82 | 37252.27 | ▅▇▃▁▁ |
| other_necessities_annual | 0 | 1 | 7015.32 | 2397.42 | 2611.64 | 5286.35 | 6733.06 | 8413.09 | 28829.44 | ▇▅▁▁▁ |
| childcare_annual | 0 | 1 | 9879.58 | 6778.22 | 0.00 | 5341.62 | 10166.34 | 14276.38 | 48831.09 | ▇▇▁▁▁ |
| taxes_annual | 0 | 1 | 7657.71 | 3339.80 | 1027.80 | 5597.97 | 6898.47 | 8790.21 | 47753.39 | ▇▁▁▁▁ |
| total_annual | 0 | 1 | 70901.68 | 21846.55 | 30087.66 | 53776.02 | 70977.68 | 85371.34 | 223717.55 | ▇▇▁▁▁ |
| median_family_income | 10 | 1 | 68316.00 | 16886.97 | 25529.98 | 57223.99 | 65955.61 | 76136.07 | 177662.47 | ▂▇▁▁▁ |
| num_counties_in_st | 10 | 1 | 96.51 | 56.88 | 1.00 | 64.00 | 87.00 | 105.00 | 254.00 | ▂▇▂▁▁ |
| st_cost_rank | 10 | 1 | 48.77 | 43.05 | 1.00 | 18.00 | 39.00 | 66.00 | 254.00 | ▇▃▁▁▁ |
| st_med_aff_rank | 10 | 1 | 48.74 | 43.08 | 1.00 | 18.00 | 39.00 | 66.00 | 254.00 | ▇▃▁▁▁ |
| st_income_rank | 10 | 1 | 48.77 | 43.05 | 1.00 | 18.00 | 39.00 | 66.00 | 254.00 | ▇▃▁▁▁ |
Cost of Living Dataset
| Name | Piped data |
| Number of rows | 31430 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| logical | 1 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| state | 0 | 1 | 2 | 2 | 0 | 51 | 0 |
| areaname | 0 | 1 | 12 | 183 | 0 | 2561 | 0 |
| county | 0 | 1 | 10 | 33 | 0 | 1877 | 0 |
| family_member_count | 0 | 1 | 4 | 4 | 0 | 10 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| isMetro | 0 | 1 | 0.37 | FAL: 19730, TRU: 11700 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| case_id | 0 | 1 | 1589.31 | 917.22 | 1.00 | 792.00 | 1593.00 | 2386.00 | 3171.00 | ▇▇▇▇▇ |
| housing_cost | 0 | 1 | 11073.67 | 4165.61 | 4209.31 | 8580.00 | 10416.00 | 12444.00 | 61735.59 | ▇▁▁▁▁ |
| food_cost | 0 | 1 | 8287.50 | 3271.14 | 2220.28 | 5801.42 | 8129.16 | 10703.62 | 31178.62 | ▇▇▁▁▁ |
| transportation_cost | 0 | 1 | 13593.86 | 1640.46 | 2216.46 | 12535.16 | 13698.16 | 14765.76 | 19816.48 | ▁▁▃▇▁ |
| healthcare_cost | 0 | 1 | 13394.03 | 5204.55 | 3476.38 | 9667.44 | 13082.70 | 16657.82 | 37252.27 | ▅▇▃▁▁ |
| other_necessities_cost | 0 | 1 | 7015.32 | 2397.42 | 2611.64 | 5286.35 | 6733.06 | 8413.09 | 28829.44 | ▇▅▁▁▁ |
| childcare_cost | 0 | 1 | 9879.58 | 6778.22 | 0.00 | 5341.62 | 10166.34 | 14276.38 | 48831.09 | ▇▇▁▁▁ |
| taxes | 0 | 1 | 7657.71 | 3339.80 | 1027.80 | 5597.97 | 6898.47 | 8790.21 | 47753.39 | ▇▁▁▁▁ |
| total_cost | 0 | 1 | 70901.68 | 21846.55 | 30087.66 | 53776.02 | 70977.68 | 85371.34 | 223717.55 | ▇▇▁▁▁ |
| median_family_income | 10 | 1 | 68316.00 | 16886.97 | 25529.98 | 57223.99 | 65955.61 | 76136.07 | 177662.47 | ▂▇▁▁▁ |
The fbc_data there are 27 variables and 31,430 observations. Within this, there are twenty-three numerical variables and four categorical variables. However, I want to change the metro variable from being a numerical variable with only 0 and 1s to that of a logical later on. Five variables, which all pertain to different rankings and extra information, share the same amount of missing data. However, I feel that since it is such a small percentage, with each missing only 10 values (0.3181) respectively of their portion of the data, I will see how I can adapt or change my analysis to either ignore or get rid of this missingness altogether.
Looking at the smaller version of the data, which looks only at the annual costs has 15 variables and 31430 observations. Within this, there are five categorical variables and ten numerical variables. There is only one variable which has missing values, being the median_family_income variable. It is missing ten instances (0,3181 of the data), thus since it is a small number I will see how I can operate around this missingness. This variable was also one of the five missing in the larger dataset, thus this helps to prove that this dataset is the same as the larger fbc_data dataset, but just showcasing specific variables.
Potential data issues
I think the main issues that I will face is being able to figure out how to navigate looking at all the different levels that my data is assessing, relating to the different family types, different counties, different states, and how they differ between annual and monthly cost.
Additionally, I feel that I might struggle slightly with wanting to add in my own variables of minimum wage and the racial composition or racial majority of each county into my dataset and analysis.
Furthermore, I have to look more into where the missing data for the variables lie, and how I can navigate around them in order to still provide similar and consistent comparisons for each county and family type.
Misc
As I have been discussing above, I plan to add other information such as minimum wage for each state and the ethnic makeup or majority in each county, if I am able to find this for each or most counties in America. I plan to add the minimum wage through using the case when function, however I am unsure of how I want to add the ethnic makeup variable to my dataset. Currently, I think I will make my own dataset in excel, and then join it together with my current dataset.
Furthermore, I currently feel that I want to look at how minimum wage of each state corresponds with the annual cost in each county and how being in a metro area makes this either higher or lower.
Thinking about this, I was wondering if I should look to see if different metro areas have different minimum wages than the state as a whole as well.
Footnotes
These datasets were sourced from EPI Family Budget Map & Kaggle US Cost of Living Dataset.↩︎